This is an old revision of the document!
Very WIP page(as of 5th February 2025), you can help by asking questions in the telegram group or in-person on Mondays
3d scan of tami
the scan is based on a phone video recording of me walking around tami.
Now to get from a video to a mesh/gaussian splat the first step is to figure out where all the frames are relative to each other in space. (aka Global Registration)
The first two steps consist of feature detection and matching.
The rolling shutter and overexposure from the video were too complex for the SIFT based colmap feature detection. Given that we have 3014 frames from the video, the exhaustive matching (3014**2 matches) from colmap is way too slow.
We can make assumptions about our data, such as the fact that time moves forward and that our camera moves linearly in space(aka no teleportation) to reduce the amount of matches we have to do.
At first I tried using RoMa which is the current SOTA model combining feature detection and matching(https://arxiv.org/abs/2305.15404v2). But it was too slow per each pair.
After that I switched to superpoint for the feature detection and lightglue for feature matching which seems to be a fairly popular combo currently.
First I matched every frame with its 30 nearest frames(by time). Then I manually found cases where I revisit the same place in the video and matched between such frames to serve as loop closure points. After converting the point data into a colmap database, I used colmap's transitive pair generator to complete the matching.
Then used GLOMAP on the image pairs to get the global poses and a sparse point cloud.
This scene can already be used for 3d gaussian splatting, but it had alot of issues recovering fine details anywhere outside the initial sparse point cloud, random initialization helped but then it struggled with big floaters and incredibly long training times.
I tried other 3dgs pruning schedules/techniques such as AbsGS and PixelGS but they had overall worse results as their repos didnt include depth regularization and antialiasing.
I decided to try making a dense point cloud, but the colmap mvs was way too slow to compute.
I tried projecting points from a monocular depth estimate(apple depth pro) but it was too globally unstable between frames.
coarse patchmatch mvs depth estimation
alignment of neural monocular estimate to patchmatch(ransac polynomial fitting)
dense point cloud integration
3dgs scene refinement
3dgs depth rasterization
depth kernel density outlier detection
tsdf integration
mesh compresson
voxelization
cam16_ucs color space nearest neighbors
minetest world format bullshit