Given multi-level masks from Segment Anything Model (SAM), GARField optimizes a scale-conditioned affinity field that describes how similar different 3D points are. Grouping is a fundamentally ambiguous task, where input 2D masks may be overlapping or conflicting within and across views. GARField resolves these ambiguities by conditioning on euclidean scene scale, where it selects different groupings depending on their 3D size. Once trained, GARField can be either interactively queried by providing points along with a scale, or produce globally consistent clusterings automatically at a hierarchy of scales. The groups can be used for downstream tasks like extraction or scene manipulation.
Given a single 3D point, GARField can select multiple groups by thresholding affinity across scale!
GARField's affinity can be recursively clustered to break a scene into smaller and smaller subcomponents automatically.
GARField can be used to extract 3D-complete assets from casual scene scans, which can then be simulated; for example:
dropping ...
or exploded!
GARField uses a contrastive loss to optimize its feature field. Given pairs of rays in a training batch, rays which land in different 2D masks are pushed apart in feature space, and those within the same 2D mask are pulled together. Affinity is the L2 distance in feature space between points. Features are conditioned on scale, which depends on the actual 3D size of the masks obtained via backprojection.
Humans can interpret a scene at multiple levels of granularity. However, this richness also creates ambiguities in grouping, making it difficult to use them as supervision: Two pieces of a watermelon wedges are separate, but also part of the same watermelon. To reconcile these conflicting signals into a single affinity field, GARField embraces this ambiguity through physical scale, allowing a point to belong to different groups of different sizes.
SAM is a 2D approach and therefore the groupings suggested by it are not 3D consistent across viewpoints and scale by definition. It also only allows for 3 possible groupings per point. GARField produces more complete groups by incorporating masks from many different views.
If you use this work or find it helpful, please consider citing: (bibtex)
@inproceedings{garfield2024, author = {Kim, Chung Min* and Wu, Mingxuan* and Kerr, Justin* and Tancik, Matthew and Goldberg, Ken and Kanazawa, Angjoo}, title = {GARField: Group Anything with Radiance Fields}, booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2024}, }