Dense and Continuous Correspondence Distributions
for Object Pose Estimation with Learnt Surface Embeddings
Rasmus Laurvig Haugaard, Anders Glent Buch
SDU Robotics, University of Southern Denmark
We present an approach to learn dense, continuous 2D-3D correspondence distributions over the surface of objects from data with no prior knowledge of visual ambiguities like symmetry. We also present a new method for 6D pose estimation of rigid objects using the learnt distributions to sample, score and refine pose hypotheses. The correspondence distributions are learnt with a contrastive loss, represented in object-specific latent spaces by an encoder-decoder query model and a small fully connected key model. Our method is unsupervised with respect to visual ambiguities, yet we show that the query- and key models learn to represent accurate multi-modal surface distributions. Our pose estimation method improves the state-of-the-art significantly on the comprehensive BOP Challenge, trained purely on synthetic data, even compared with methods trained on real data.
Hover over the input image (left) or query image (middle) to see the estimated correspondence distribution. Drag the keys (right) to rotate them, or drag the slider to interpolate between keys' object coordinates and UMAP projection.
Interesting examples include discrete rotational symmetry: , , , , full rotational symmetry: , , , and no symmetry: , , , , , , .
The UMAP projection of keys shows the key manifold and primarily provides insight to constant ambiguities, like global symmetry. A query can also represent view-dependent ambiguities which is best seen by exploring the distributions.
Examples are chosen randomly for each object in the seven BOP Challenge datasets. Models are trained purely on synthetic images and shown on real images except for ITODD and HB, shown on synthetic images, because ground truth poses are not publicly available on the real images.
Distributions may appear worse than they are because of poor ground truth poses: (see the distributions near the edges), or poor meshes: .
We use the distributions to sample, score and refine pose hypotheses. Our pose estimation method is trained and evaluated on the seven BOP Challenge datasets:
Method | Domain | Synth | LM-O | T-LESS | TUD-L | IC-BIN | ITODD | HB | YCB-V | Avg |
---|---|---|---|---|---|---|---|---|---|---|
Methods using color, trained purely on synthetic images | ||||||||||
SurfEmb | RGB | ✓ | 0.656 | 0.741 | 0.715 | 0.585 | 0.387 | 0.793 | 0.653 | 0.647 |
Epos | RGB | ✓ | 0.547 | 0.467 | 0.558 | 0.363 | 0.186 | 0.580 | 0.499 | 0.457 |
CDPNv2 | RGB | ✓ | 0.624 | 0.407 | 0.588 | 0.473 | 0.102 | 0.722 | 0.390 | 0.472 |
DPODv2 | RGB | ✓ | 0.584 | 0.636 | - | - | - | 0.725 | - | - |
PVNet | RGB | ✓ | 0.575 | - | - | - | - | - | - | - |
CosyPose | RGB | ✓ | 0.633 | 0.640 | 0.685 | 0.583 | 0.216 | 0.656 | 0.574 | 0.570 |
Methods using depth | ||||||||||
SurfEmb | RGB-D | ✓ | 0.758 | 0.828 | 0.854 | 0.656 | 0.498 | 0.867 | 0.806 | 0.752 |
SurfEmb | RGB-D | ✗ | 0.758 | 0.833 | 0.933 | 0.656 | 0.498 | 0.867 | 0.824 | 0.767 |
Drost | RGB-D | * | 0.515 | 0.500 | 0.851 | 0.368 | 0.570 | 0.671 | 0.375 | 0.550 |
Vidal Sensors | D | * | 0.582 | 0.538 | 0.876 | 0.393 | 0.435 | 0.706 | 0.450 | 0.569 |
Koenig-Hybrid | RGB-D | ✗ | 0.631 | 0.655 | 0.920 | 0.430 | 0.483 | 0.651 | 0.701 | 0.639 |
Pix2Pose | RGB-D | ✗ | 0.588 | 0.512 | 0.820 | 0.390 | 0.351 | 0.695 | 0.780 | 0.591 |
CosyPose | RGB-D | ✗ | 0.714 | 0.701 | 0.939 | 0.647 | 0.313 | 0.712 | 0.861 | 0.698 |
Synth: Method is trained purely on synthetic images. Note that only T-LESS, TUD-L and YCBV provide real images for training. * Methods do not use the available training data.